SPEECH MODEL TRAINING TECHNIQUE FOR SPEECH 

RECOGNITION 



BACKGROUND OF THE INVENTION 

Field of the Invention 

The invention relates to a training technique of speech recognition and, 
more particularly, to a speech model training technique with high recognition 
rate to be applied in a noisy environment. 

Description of the Related Art 

In recent years, the techniques for making electronic products have been 
incorporated with the techniques for making information and communication 
products. Through networks, all these techniques can be linked together. 
Benefiting from the advancement of these techniques, an automatic living 
environment has been created for more conveniences in living and working. 
As a result, a user is able to use a speech recognizer in various environments 
through different communication products. However, since noises generated 
in a noisy environment may vary, the recognition rate of a speech recognition 
device will eventually be deteriorated because of this variation. 

There are two stages for speech recognition: the first is a training stage, 
and the second is a recognition stage. During the training stage, different 
voices will be collected first, and then by applying statistics, a speech model 
can be generated. After that, the speech model is applied to a learning 
procedure so that the speech recognition device can have a capability to learn. 
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Then, the speech recognition capability of the device can be enhanced through 
iterative training as well as recognition technique by matching. Therefore, it 
is comprehensible that the training technique employed by a training model can 
significantly affect the recognition ability of the speech recognition device. 
5 Conventional speech training techniques include two categories: one is the 

Discriminative Training (hereinafter referred to as DT), and the other is the 
Robust Training (Robust Environmental-effects Suppression Training, 
hereinafter referred to as REST). The DT technique is to employ a statistical 
method for collecting homogeneous phonetic signals that are easy to be 

10 confused. Then, when in training, the homogeneous speech training data will 
be taken into consideration for generating a model with high discriminative 
capabilities. For one thing, the DT technique can function efficiently in 
learning clean speech when employed in a quiet environment, whereas it may 
function less efficiently in a noisy environment. In addition to this drawback, 

15 the speech model generated by the DT technique in a noisy environment will 
tend to be over- fitting and lacking of generalization capability. It means that 
the DT model has been adapted to a model that is only suitable for a certain 
noisy environment, and when there is a change in that environment, the 
recognition effect can be decreased tremendously. Unlike the DT technique, 

20 the REST technique can statistically estimate the homogeneous phonetic 
information and suppress the environmental effects to enhance the robust 
capability of speech recognition. However, despite how robust the REST 
technique can be, its speech discriminative capability is less powerful than that 
of the DT technique. 



Therefore, focusing on the aforementioned problems, the invention 
provides a speech model training technique for speech recognition that 
possesses both discriminative capability and robust capability in a noisy 
environment. 

5 

SUMMARY OF THE INVENTION 

The main and first object of the invention is to provide a speech model 
training technique for speech recognition, which first employs the REST 
technique to separate the environmental effects residing in the inputted speech, 

10 and then the remaining clean speech will be trained by the DT technique, so 
that the obtained speech training model can possess not only robust capability 
but also discriminative capability through both techniques; by doing so, the 
conventional problem, which is unable to concurrently own both capabilities, 
can then be resolved, and the recognition rate can be enhanced as well. 

15 The second object of the invention is to provide a speech model training 

technique for speech recognition, which is suitable for compensation-based 
recognition in a noisy environment so as to enhance the efficiency of speech 
recognition rate in a noisy environment. 

The third object of the invention is to treat each voice effect in the inputted 

20 speech as an individual voice effect and then separate it individually so that 
each distortion effect can be separated to achieve a precise control in 
environmental effects. 

According to the invention, a speech model training technique for speech 
recognition includes the following steps: first, the inputted speech will be 
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separated into one compact speech model of clean voice and one environmental 
interference model; next, according to the environmental interference model, 
the environmental effects in the inputted speech will be filtered out to obtain a 
phonetic signal; finally, the phonetic signal and the compact speech model will 

5 employ the DT algorithm and obtain a compact speech training model with 
high discriminative capability so as to provide the speech recognition device 
for the subsequent processing of speech recognition. 

The objects and technical contents of the invention will be better 
understood through the description of the following embodiments with 

10 reference to the drawings. 



BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1(a) and FIG. 1(b) are schematic diagrams showing the structure of 
speech model training technique in the invention. 
15 FIG 2 is a schematic diagram showing a comparison of recognition results 

between the training technique of the prior art and the training technique of the 
invention. 

FIG. 3 is a schematic diagram showing another comparison of recognition 
results between the training technique of the prior art and the training technique 
20 of the invention. 



DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The speech model training technique of the invention first employs the 
REST technique to separate the inputted speech and make it into a compact 



speech model and an environmental interference model so that the compact 
speech model can be used as a seed model for model compensation. In 
addition, through the DT algorithm, a speech training model with high 
discriminative capability can be obtained so as to provide the speech 
recognition device for the subsequent processing of speech recognition. 

FIG 1(a) and FIG. 1(b) are schematic diagrams showing the structure of 
speech model training technique in the invention. As shown in FIG 1(a), the 
compact speech model \ x and an environmental interference model \ e will 

firstly be modeled and separated by employing the REST algorithm (1) on the 
inputted speech Z. Signals of the environmental interference model \ e include 

channel signals and noises. The examples of well-known channel signals are 
microphone effect and speaker bias. Next, as shown in FIG 1(b), the 
environmental interference model \ e will be used for suppressing the 

environmental interference of the inputted speech Z so as to obtain a speech 
signal X. The process for filtering out the environmental interference usually 
is carried out by means of a filter. Finally, the generalized probabilistic 
descent (GPD) training scheme in the DT technique is employed to plug the 
speech signal X into the compact speech model \ x that has been done with 

environmental-effects suppression. Then, after the calculation, a compact 
speech model \ x * with high discriminative capability can be obtained. 

After applying the algorithm of the invention and obtaining the compact 
speech model \ x ' with high discriminative capability, a method of parallel 
model combination (PMC) and a recognition method through signal bias 
compensation, usually referred to as the PMC-SBC (see the appendage 1), will 



be used during the recognition stage applied in the speech recognition device, 
so that the speech model Aj/ can ^ e compensated to respond to the current 
operational environment, followed by a recognition procedure. The method 
of PMC-SBC will be illustrated as follows: first, by comparing the non-speech 
5 output of the Recurrent Neural Network (RNN) with a predetermined threshold, 
the non-speech frames can be detected, which can be used for calculating the 
on-line noise model. Next, the state-based Wiener filtering method will be 
employed, which utilizes the feature of stable random processing and the 
feature of spectrum to filter out the signals with noises, so that the r-th 

10 utterance of the inputted speech, referred to as % {r) , can be processed to obtain 
an enhanced speech signal. Then, the utterance ^ (r) of the enhanced speech 
signal will be converted into a Cepstrum Domain to estimate the channel bias 
by the SBR method. In turn, the SBR will estimate the bias by first encoding 
the feature vectors of the enhanced speech using a codebook and then 

15 calculating the average encoding residuals. To form a codebook, first, the 
mean vectors of mixture components in the compact speech should be 
collected. Then, the channel bias is used to convert all the speech 
models \ x ' into bias-compensated speech models. Afterwards, these 

bias-compensated speech models will be further converted by means of the 
20 PMC method and the on-line noise model into noise- and bias-compensated 
speech models. Finally, these noise- and bias-compensated speech models 
can be used for subsequent recognition of the inputted utterance £ (r) . 

The speech model training technique of the invention can be applied to a 
device with a speech recognizer, such as a car speech recognizer, a PDA 



(Personal Digital Assistance) speech recognizer, and a telephone/cell-phone 
speech recognizer. 

To sum up, the invention is to separate the noises in the inputted speech by 
using the REST technique, and then train the clean speech by using the DT 
5 technique. Through integrating the REST and DT techniques, the compact 
speech training model provided by the invention not only can own both robust 
capability and discriminative capability, but also can be adaptable to 
compensation recognition in a noisy environment. In addition, because the 
learning technique provided by the invention is able to individually separate 
10 each voice effect in the inputted speech, each distortion effect can be 

individually separated as well. Therefore, the learning technique can be 
applied to selective control of environmental-effect signal, for instance, the 
control of environmental effects to speech or the adaptability of a speech 
model. 

15 So far, the algorithm of the invention has been described theoretically. In 

the following, a practical embodiment will be illustrated in detail to verify the 
algorithm of the invention. The algorithm of the invention is a combined 
technique of discriminative and robust training algorithms, referred to as the 
D-REST (Discriminative and Robust Environment-effects Suppression 

20 Training) hereinafter. The D-REST algorithm is that in a presumed noisy 
speech realization model, the homogeneous and clean speech )6 r) will pass 
through the noisy speech model and derive the£ (r) , wherein the £ (r) represents 
the speech feature vector sequence of the r-th utterance. Consider the set of 
discriminative functions {g h i=l t 2...,M} with the environment-compensated 



speech HMMs (Hidden Markov Models) /^ z r) of % ir) defined by 

g,(z" , ;A:' > )-'og[Pr(z" , .f/r|Ar)] 

= iog[Pr(z'".t/r|A,®Aj] (i) 

where jj {r) is the maximum likelihood state sequence of^ (r) to the i-th HMM 
5 of y\^ r) ; \ x denote the set of environment-effects suppressed HMMs (i.e., the 
compact speech model), and ^ is the set of environmental interference models. 

The symbol ® denotes the operand of model compensation, which is also 
employed in the recognition process. 

The goal of the D-REST algorithm is to estimate \ x and \ e with a set of 
10 discriminative functions {g h i=l,2...,M) , and to make \ x as a robust and 
discriminative seed model for model compensation-based noisy speech 
recognition. 

The first stage of the D-REST algorithm is to concurrently estimate the 
compact speech models \ x and environmental interference models \ e . 

15 Assume that the environmental-effects comprise a channel b and an additive 
noise n on each utterance. Let \ e = {a^'Z/^Li r denote * e set °f 
environmental interference models of the whole training data set, 
where fr {r) and y\^ r) are, respectively, the signal bias and the noise model of the 
r-th training utterance. Based on the ML (maximum likelihood) criterion, the 

20 goal is to jointly estimate \ x and \ e with the given [^ (r) } r=1 R by 

(A,, A e ) = argmaxPr({z ( ^} r=1 ..^|A,, A e ) (2) 



During the iterative training procedure, the REST technique will be 
sequentially employed to optimize the Equation (1), including the following 



three operations: (1) form the compensated HMMs y\^ r) by using the current 

estimate { \ , \ e } and use it to optimally segment the training utterance £ (r) ; 

(2) based on the segmentation result, estimate y\^ r) and enhance the adverse 

speech 2 r(r) to obtain y (r) , and then estimate b (r) and further enhance the 
5 speech y (r) to obtain J^ (r) ; (3) update the current speech HMM 

models \ x using the enhanced speech \j£ (r) j r=l R . 

Also, owing to the involvement of the environment-effect compensation 

operation in the training process, it can be expected that the better reference 

speech HMM models for the robust recognition method can be generated. 
10 Moreover, the separate modeling of/^and^ a U° ws the training process to 

focus on the modeling of phonetic variation without the unwanted influence 

coming from the environmental effects. 

The second stage of the D-REST algorithm is to perform a discriminative 

training with minimum classification error (MCE), and the algorithm is based 
15 on the observed speech Z with its environment-compensated speech HMM 

models y\^ r) • The segmental GPD (generalized probabilistic decent)-based 

training procedure (see the appendage 2) is adopted here, with the following 

misclassification measure of 2' (r) : 

4(ZlAT) = -*<(z (r) ; ATKftfe"; AT) (3) 
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where k = argmax jJ;iJ Pr(z' (r \£/^ ) \[ r) ); from the equation (3) and by assuming 
that Y (r) =Y . and that the state-based Wiener filtering is the inverse 
operation of the PMC (see the appendage 3), the ^r{^ (r) ,JJ^\[ r) ) in ^ 
equation (1) can be rewritten as: 



pr(z"*.t/r|Ar) ^{z M .ur\k: M *b ( ' , -h I .i: { :i,,]} w 
=p{x''\ur\kL-r:L}) 

where the equation (3) can be expressed as: 

d i (Z (r) \AV) = d i (x {r) \A x ) (5) 

The equation (5) shows that performing the MCE-based training on Z and the 
environment-compensated HMM model JS^p is equivalent to performing the 

MCE-based training on the environment-effects suppressed speech X with 
given compact model \ x . 

Therefore, from the implementation of the foregoing speech model 
training technique, a compact speech training model with high discriminative 
capability can be obtained. The following description will employ two 
embodiments to verify the functions and efficiency of the invention. 
Referring to FIG 2, the first embodiment is to apply the D-REST technique of 
the invention, the generalized probabilistic descent training technique of the 
prior art, and the REST training technique in an in-car noisy environment with 
GSM (Global System for Mobile Communication) transmission channels. In 
the application, different speech classification errors in the environments with 
different noise ratios are compared, wherein the control group is using the 
conventional HMM recognition technique without any noise model 
compensation. After the comparison, it is obvious from the testing results that 
regardless of being in a clean- voice or a high-noise environment with a 
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signal-noise ratio at 3, the minimum classification error can still be found when 
the in-car speech recognition device is using the D-REST speech model 
training technique of the invention. Therefore, the optimal recognition effect 
can well be achieved. 

Also, another embodiment is shown in FIG. 3, in which the testing 
conditions and targets are the same as those of in the first embodiment. The 
only difference between the two embodiments is that the car noise type of the 
training corpus is different from that of the testing corpus. However, it can be 
understood from the tested result that when the D-REST speech model training 
technique of the invention is applied, the minimum classification error can be 
obtained regardless of the difference in signal-noise ratios. On the other hand, 
if the GPD training technique is applied, the result is worsen than that in the 
control group. The reason is that the generated speech model is over-fitting 
and lacking of generalization. Therefore, even though the environment for 
testing only has a slight change, the recognition effect will respond with a 
serious decrease. 

The embodiments above are only intended to illustrate the invention; they 
do not, however, to limit the invention to the specific embodiments. 
Accordingly, various modifications and changes may be made without 
departing from the spirit and scope of the invention as described in the 
following claims. 



11 



